Shiny app link: https://groupa-autumnclass.shinyapps.io/GroupProject-Shiny/

1. Business case

The Covid event created new incentives for city dwellers to move from condos to detached either in or outside Toronto.

In this context we propose a simple tool for getting informed on the transactions that have happened in Toronto with the purpose of both:

  1. estimating of the market value of a property in Toronto given location and other user defined features

  2. and, given budget and other constraints,visualizing the properties sold.

At the moment, given the project time budget constraint, even though we have been able to web scrape the latest available listings on zoocasa but did not have sufficient time to clean and process the data. Instead, to illustrate the business case solution, we used a pre-existing dataset available as at H2 2019 which was alreadty cleaned and processed.

2. Project data

2.1 Data sources

The current project builds upon a pre-existing project on Toronto housing price prediction available at https://github.com/slavaspirin/Toronto-housing-price-prediction

The dataset used from the pre-existing project is available at https://github.com/slavaspirin/Toronto-housing-price-prediction/raw/master/houses_edited.csv

The data from the pre-existing project is based on web scarping of Zoocasa listings of previously sold properties. Unfortunately the data does not have a time stamp. We understand that the primary listing data was scraped from https://www.zoocasa.com and contains a list of sold properties available as sometimes in H2 2019.

We were able to scrap a scoring of Toronto neighborhoods from https://torontolife.com/neighbourhood-rankings/ to complemet the average 2016 personal income data of each district that was already available in the pre-existing project.

Toronto neighborhoods data for geographical mapping was available from https://open.toronto.ca/dataset/neighbourhoods/.

We were also able to obtain location data for the Toronto subway stations from https://scruss.com/blog/2005/12/14/toronto-subway-station-gps-locations/#comments. For the line currently in construction, Line 5 Eglington, the subway stations latitude and longitude were obtained from https://en.wikipedia.org/wiki/Line_5_Eglinton.

2.2. Data description

2.2.1 Zoocasa listings in the pre-existing project dataset

The data available includes 15234 listings of Toronto properties with the following available features.

Variable Name Description
title text, Zoocasa short description of the listing
final price numeric,sale price
listed price numeric, listed price
bedrooms text ordinal, 0 beds, 0 + 1 beds, 1 beds … 9 + 5 beds
bedrooms>grade numeric, number of bedrooms above grade
bedrooms<grade numeric, number of bedrooms below grade
bathrooms mumeric, 1 to 11
sqft Missing or numeric between 259 to 4374
description text, Zoocasa long description of the listed property
mls text, zoocasa identifier
type text categorical, Att/Row/Twnhouse, Comm Element Condo, Condo Apt, Condo Townhouse, Co-Op Apt, Co-Ownership Apt, Detached, Link, Plex, Semi-Detached, Store W/Apt/Offc
full link text,Zoocasa web link
lat numeric, property location latitude
long numeric, property location longitude
city district text, Toronto city district
district code numeric, Toronto city district identifier code
mean district income numeric, Toronto city district average household income based on 2016 statics

Based on the “bedrooms>grade” and “bedrooms<grade” we created an aggregated bedrooms feature calculated as “bedrooms>grade”+bedrooms<grade/2 to account for the smaller size of the below grade bedrooms.

Based on the “listed price” and “final price” we created an “price differential” feature calculated as “final price”/“listed price” - 1. Even though such a feature is not necessarily useful for predicting the sale price, it is informative with respect to the pricing error that property sellers have encountered and may be informed upon with when listing a property.

2.2.2 Toronto subway - walking distance to the closest subway station

Based on location data for the Toronto subway stations (including Line 5 Eglinton) we were able to estimate the closest subway station to a property and estimate, assuming and average walking speed of 5 km/h the walking distance to the closest subway station for each property.

2.2.3 Toronto neighbourhood rankings

Variable Name Description
district code numeric,Toronto city district identifier code
area name text,Toronto city district name
description text,description of the district
housing score numeric, score based on affordability (cost vs. income), appreciation (yoy change) and rate of home ownership
safety score numeric, score based on number of crimes
transit score numeric, score based on number of TTC stops, walk and transit scores, commuting times, numbers of commuters who walk, cycle or take TTC
shopping score numeric, score based on number of groceries, markets and pharmacies per km2
health score numeric, score based on number of medical and mental health services per capita, number of senior care services per senior, number of people with family doctors and physical activity levels among residents
entertainment score score based on numeric,number of gyms, sport facilities, bars and restaurants per km2
community score numeric, score based on voter turnout, community space use per capita, how many people report a sense of community belonging
education score numeric, score based on number of schools per child, number of daycares per child, share of residents with post-secondary education
diversity score numeric, score based on % of visible minorities , people whose mother tongues are not French or English, and first- and second generation immigrants
employment score numeric, score based on employment and unemployment rates, the share of residents below the poverty line, the share of high income residents and the share of self employed residents

2.3 Data preparation

The data pertaining to housing listings was cleaned,aggregated and readily available on https://github.com/slavaspirin/Toronto-housing-price-prediction/raw/master/houses_edited.csv

2.4 Descriptive statistics

2.4.1. Property sale price distribution

The histogram of sale prices and log sale prices for the most condos, townhouses, detached and semi-detached(the inclusion criteria captures 94% of our listings) indicate that the log transforms shift the distribution closer to a Gaussian one.

One can also observe that the sale relative to listed price has a positively skewed distribution when above zero, meaning that sellers were getting more than listed. Nevertheless, the very high values observed (50% mark up) make us less inclined to use the listed price in our analysis. It is possible there may be a bias related to increasing the marketability of the property and gathering more offers during the listing period.

2.4.2. Properties features characteristics

An interesting property of condos is related to the subway walking distance feature. Condos are concentrated within less than 39 minutes to the closest subway station as the corresponding histogram abruptly drops around the 39 minutes threshold.

2.4.3. Toronto districts descriptive statistics

The Toronto district scores seem to be designed to be uniformly distributed, in contrast with the district average income which seems to be non-uniform.

Given the nature of the district average income having a different distribution that the districts scores we use the Spearman (rank) correlation.

One can observe the following strong correlations:

  1. the diversity score is strongly negatively correlated with the employment score and the average district income,
  2. the employment score is strongly positively correlated with the average district income and
  3. transit, shopping end entertainment scores are highly correlated.

We have also computed the Pearson correlations and observed that the most extreme values was 0.83 (shopping vs. entertainment scores). As such, these correlations would not pose problems in a linear regression (i.e. the X’X matrix is invertible).

2.4.4. Correlation - sale price and properties features vs. with Toronto’s districts scores and average income

We have computed the Pearson correlations to observe how all property and location features are interconnected. As expected, the size of the apartment influences the sale price. In terms of district location, district average income and the sores for safety, diversity and employment are also significantly correlated (1% p-value test) with the sale price.

Unexpectedly, the “subway walking distance” to has a negative, small correlation with the final sale price. One of the reasons could be related to differences in this relationship across types of properties:

  1. for condos, subway closeness is much more important than for detached,

  2. for detached houses, the size of the property, and implicitly the price, is higher, the further away one gets from the subway lines.

We note that the most extreme correlation values are below 0.85 (excluding the “bedroom>grade” vs “Bedrooms Agg” correlation). As such, these correlations would not pose problems in a linear regression (i.e. the X’X matrix is invertible).

2.4.5. Sale price distribution in relation with various features

2.4.6. Descriptive statistics by geographical district

The following mappings attempt to provide an overview of the listings available as wel as the relationship between districts and house prices.

One can observe that the number of listings not uniformly concentrated across Toronto. With respect to housing price levels, both the detached and condo market seem to be exhibiting a similar geographical distribution with the district average income, confirming the positive correlation between prices and district average income.

3 Modeling

To predict the house prices we use the data set mentioned in the data sources section as our training data to build a machine learning model. Once the model is built and evaluated we will deploy this to the shiny app for predicting the price of an user input house. Given the time constraint we did not include the “walking distance to subway” feature which would have required more time for implementing in the shiny app.

3.1 Linear Regression

Linear regression is a basic and very commonly used predictive algorithm. Two main types of information can be obtained with linear regression, the extent at which a set of independent variables (X) can predict a dependent variable (Y), and which of the variables are significantly related to Y.

In it’s simplest form a linear regression can be described by the formula y = c + b * x, where y is the dependent variable, c is a constant (y-intercept), b is the regression coefficient (slope), and x is the independent, or predictive variable. Linear regression assumes linearity between X and the mean of Y, homoskedasticity, meaning that the variance of the residual is the constant and is independent of Y and the predictive variables X.

Linear regression can accommodate both numerical and categorical explanatory variables (the categorical predictors are transformed into dummy variables).

In this section we will build a multiple linear regression model for predicting the response - final_price, regressed on the remaining predictor variables in the data set. We will use the least squares minimization approach in this project.

We removed the variables from our data frames that are not relevant to the project.

Based on our exploratory data analysis, the detached and non-detached markets show distinctively separate dynamics in their behavior in the context of this project. For e.g detached houses show large variations in the price, the absolute price ranges are wide and higher and correlation between other variables show different trends. So to optimize our results we approach by developing 2 separate models - one for the detached market and non-detached market. Note: The non-detached market includes all other types of housing including semi-detached, condos, freeholds, townhouses etc.

3.1.1 Missing data imputation

We see that the predictor sqft is missing data for many observations. We will impute this missing data using multivariate imputation by chained equations (MICE) technique. We use the “mice” package in R for this purpose and use the Random Forest method with default imputation cycles (5).

To make the best estimations of the missing sqft values, we run the imputation process separately on the detached and non-detached houses.

Perform a sanity check of imputation and then apply the imputed values to the dataframe.

We can also apply the semi-parametric imputation approach Predictive Mean Matching (PMM) for the same.

PMM calculations involve several steps. First, a linear regression model is estimated to predict the feature Y that contains the missing values, whereby only the observed instances are used. Second, the predicted values for the observed and missing values are calculated. Third, for each instance where Y is missing, the closest predicted values among the observed instances are found, and finally, one of these close instances is randomly chosen and the missing value is imputed with the observed value from this close instance.

PMM also allows for discrete target variables, and has been proven to be a versatile and robust method.

3.1.2 Model Development

Multi-colinearity

It is important to understand the correlations between the variables, including both predictors and response to improve the accuracy of the model.

The variable correlations available in the descriptive statistics as well as the correlations matrix for our detached and non-detached, below, show that certain groups of variables are strongly correlated but not as high as to pose problems of multi-colinearity.

In addition, we address indirectly the strong correlation issues through a variable selection approach as described into the next section.

Detached house variable correlations ordered based on principal components

Non-detached house variable correlations ordered based on principal components

Variable selection

We started with the method of ** backward selection** i.e start with all the variables in the model and remove the variables that shows least statistical significance (p-values).

We added variables previously removed based on their observed relationship with the sale price. Thus we will eventually be adopting a mixed selection approach for variables.

All categorical variables are be factored using the default dummy variable encoding in R.

Outliers and high-leverage observations with measure of influence

We used the Cook’s distance to detect observations that strongly influence the model. We also looked at the leverage vs. the residual plots to identify outliers or observations which makes a significant impact at the cost of reducing the overall accuracy.

During the iterative model building process we found that in the detached housing segment, properties with sale price of over $4M have very few observations but have a bigger impact by reducing the overall accuracy of the model. So we treat those observations as outliers and remove them when we train the model.

Extending the linear model for Additive and Linear assumptions

Two important assumptions in a linear regression model are that relationship between the predictors and response are additive and linear. This is almost never the case in real use cases as in our project. This shortcoming can be partially addressed by transforming numerical variables into categorical ones but it depends on a subjective approach or business information how the binning should be conducted.

Also, to recognize the non-linear effect due to interactions between the explanatory variable, we incorporated interaction effects between variables. We have also introduced non-linear effects where a non-linear relationship was observed between a predictor and the sale price.

We test a range of effects and finally retain the best effects in the model that provide significant improvement to the model fit.

3.1.3 Model estimation and diagnostics

Distribution of final sale price across all the markets.

Most of the variance is due to the detached market .

Detached market model

The proposed model recognizes the non-linear effect due to interactions between the explanatory variable through the interaction effects between variables.

We have also incorporated non-linear effects based on the non-linear relationship was observed between a predictor and the sale price.

## 
## Call:
## lm(formula = final_price/1000 ~ ., data = detached_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1445.99  -138.77    -7.39   112.60  2128.15 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       -2.031e+04  3.019e+03  -6.728 1.94e-11 ***
## bathrooms                          1.416e+02  1.204e+01  11.757  < 2e-16 ***
## sqft                              -1.691e-01  1.831e-02  -9.237  < 2e-16 ***
## parking                            1.955e+01  2.960e+00   6.604 4.48e-11 ***
## long                              -2.345e+02  3.806e+01  -6.162 7.87e-10 ***
## mean_district_income              -1.623e-02  1.028e-03 -15.788  < 2e-16 ***
## bedrooms_ag                        2.108e+02  1.471e+01  14.332  < 2e-16 ***
## housing                           -2.598e+00  2.242e-01 -11.584  < 2e-16 ***
## safety                             9.681e-01  1.940e-01   4.991 6.25e-07 ***
## transit                            2.151e+00  2.804e-01   7.670 2.11e-14 ***
## entertainment                      1.811e+00  2.502e-01   7.239 5.35e-13 ***
## community                         -5.399e-01  1.962e-01  -2.752  0.00595 ** 
## education                          8.143e-01  1.836e-01   4.435 9.46e-06 ***
## sqft_x_bathrooms                   6.798e-02  4.762e-03  14.274  < 2e-16 ***
## ibedrooms_ag_x_bathrooms          -5.041e+01  4.167e+00 -12.098  < 2e-16 ***
## sqft_x_mean_district_income        1.811e-06  1.446e-07  12.530  < 2e-16 ***
## bathrooms__squared                 3.211e+00  2.611e+00   1.230  0.21885    
## mean_district_income__square_root  1.214e+01  5.425e-01  22.375  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.1 on 4271 degrees of freedom
## Multiple R-squared:  0.8129, Adjusted R-squared:  0.8122 
## F-statistic:  1092 on 17 and 4271 DF,  p-value: < 2.2e-16

Non-detached market model

The proposed model recognizes the non-linear effect due to interactions between the explanatory variable through the interaction effects between variables.

## 
## Call:
## lm(formula = final_price/1000 ~ . + sqft:mean_district_income + 
##     sqft:type + bedrooms_ag:bathrooms + sqft:bathrooms + I(mean_district_income^0.5), 
##     data = nondetached_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1574.2   -81.8    -9.9    69.2  3699.1 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  3.693e+04  4.793e+03   7.706 1.41e-14 ***
## bathrooms                    5.649e+00  6.891e+00   0.820 0.412394    
## sqft                        -3.339e-01  2.650e-02 -12.597  < 2e-16 ***
## parking                      1.911e+01  2.558e+00   7.472 8.48e-14 ***
## typeCo-Op Apt               -4.501e+02  9.599e+01  -4.689 2.78e-06 ***
## typeCo-Ownership Apt        -7.323e+02  1.254e+02  -5.839 5.41e-09 ***
## typeComm Element Condo      -6.248e+02  5.089e+01 -12.277  < 2e-16 ***
## typeCondo Apt               -6.566e+02  3.487e+01 -18.829  < 2e-16 ***
## typeCondo Townhouse         -3.671e+02  3.813e+01  -9.629  < 2e-16 ***
## typeLink                     1.863e+02  1.489e+02   1.251 0.211013    
## typePlex                     5.645e+02  7.471e+01   7.556 4.50e-14 ***
## typeSemi-Detached           -2.542e+01  3.683e+01  -0.690 0.490063    
## typeStore W/Apt/Offc        -3.706e+02  1.713e+02  -2.163 0.030532 *  
## lat                         -6.914e+02  7.300e+01  -9.472  < 2e-16 ***
## long                         8.963e+01  2.657e+01   3.373 0.000746 ***
## mean_district_income        -1.389e-02  4.412e-04 -31.479  < 2e-16 ***
## bedrooms_ag                  1.929e+01  6.231e+00   3.095 0.001972 ** 
## bedrooms_bg                  8.336e+00  3.364e+00   2.478 0.013219 *  
## housing                     -1.327e+00  1.022e-01 -12.983  < 2e-16 ***
## safety                       2.469e-01  8.838e-02   2.794 0.005218 ** 
## transit                      1.802e+00  1.319e-01  13.662  < 2e-16 ***
## shopping                     7.394e-01  1.459e-01   5.068 4.09e-07 ***
## health                       1.563e-01  7.708e-02   2.027 0.042671 *  
## entertainment                1.385e+00  1.637e-01   8.461  < 2e-16 ***
## diversity                    1.146e+00  1.605e-01   7.142 9.76e-13 ***
## education                   -3.488e-01  8.500e-02  -4.104 4.09e-05 ***
## I(mean_district_income^0.5)  7.654e+00  2.582e-01  29.642  < 2e-16 ***
## sqft:mean_district_income    4.141e-06  9.925e-08  41.727  < 2e-16 ***
## sqft:typeCo-Op Apt           1.126e-01  9.120e-02   1.235 0.216827    
## sqft:typeCo-Ownership Apt    2.765e-01  1.688e-01   1.637 0.101575    
## sqft:typeComm Element Condo  3.990e-01  4.964e-02   8.037 1.02e-15 ***
## sqft:typeCondo Apt           4.333e-01  2.093e-02  20.703  < 2e-16 ***
## sqft:typeCondo Townhouse     1.539e-01  2.359e-02   6.525 7.10e-11 ***
## sqft:typeLink               -9.640e-02  8.729e-02  -1.104 0.269494    
## sqft:typePlex               -1.812e-01  3.424e-02  -5.292 1.23e-07 ***
## sqft:typeSemi-Detached       6.044e-02  2.110e-02   2.864 0.004188 ** 
## sqft:typeStore W/Apt/Offc    1.945e-01  8.760e-02   2.220 0.026435 *  
## bathrooms:bedrooms_ag       -1.210e+01  2.021e+00  -5.990 2.17e-09 ***
## bathrooms:sqft               7.991e-02  5.843e-03  13.675  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 178.7 on 10832 degrees of freedom
## Multiple R-squared:  0.7766, Adjusted R-squared:  0.7758 
## F-statistic: 990.8 on 38 and 10832 DF,  p-value: < 2.2e-16

Models diagnostics

For both the detached and non-detached segments the model diagnostics statistics shows that there is still a remaining non-linear relationship between the estimated level of the sale price and the model estimated residuals as shown in the “Residuals vs fitted” and the “Scale-Location” plots where the volatility of the residuals increases as the estimated level of the sale price increases.

In addition, one can observe from the “Normal q-q” plots that the residuals are not normally distributed, with extreme divergence beyond +/- 2 stdev.

Also the residuals standard errors are quite high, around 270 thousand CAD for detached and 180 thousand CAD for non-detached, sizes which are acceptable for an average property price of 2 million CAD but less acceptable for a property valued at around 700 thousand CAD.

As this is a first, time-constrained attempt at modeling property sale price data, we choose to implement these model for illustrative purposed in our shiny app. On the other hand, we are fully aware that more research needs to be done on model selection, especially if we are aiming to find a model that is easy to implement for on-the-fly evaluation.

3.1.4. Model performance . An in-sample vs-out-of sample test

We have already evaluated the accuracy of coefficients while iterating through the model building process and selected the variables with significant statistical importance determined by the p-values.

Note: When interaction effects are employed, some of the base variables (main effects) may show reduced significance but that does not reduce the importance of the variable if the interaction effects themselves rank high. The base variables cannot be removed in such cases as per the hierarchical principle.

We evaluate the accuracy of the models in the below steps using MSE and R2 statistics based on an out-of-sample estimation test. We split the detached and non-detached data we used into 2 training and test data sets respectively. We retrain the same models using the new training subset and test them on the test data.

Detached model

MSE Statistic for the sale price estimate on 20% of the dataset based on a model trained on the other 80% of the dataset

## [1] "252.6  thousand CAD"

R2 Statistic for the sale price estimate on 20% of the dataset based on a model trained on the other 80% of the dataset

## [1] "80.4 %"

Non-detached model

MSE statistic for the sale price estimate on 20% of the dataset based on a model trained on the other 80% of the dataset

## [1] "145.5  thousand CAD"

R2 statistic for the sale price estimate on 20% of the dataset based on a model trained on the other 80% of the dataset

## [1] "75.1 %"

4. Random Forest Model

In our project, we used the random forest model as a comparison to the linear regression model with respect to the selection of explanatory variables, estimation errors and model fit.

Random Forest is an ensembling machine learning algorithm which consists in generating multiple decision trees based on random sampling of the data and the predictor variables and then combining their output. for every explanatory variable a classification is generated. Based on belonging to specific classes of a randomly selected set of variable the decision three is built. The user decides how many variables are used in constructing the decision tree. The construction of the classes for each variables accommodates both categorical and numerical variables (continuous or discontinuous). The random sampling serves to de-correlate the trees and subsequently reduce the Variance by averaging them and avoid overfitting. The user decides how many trees are used for model averaging. The random forest model needs to be tuned with respect to the number of decision trees and the number of variables randomly sampled at each stage.

Random Forest model estimation results

One can observe that the distance to subway, number of bathrooms and average district income occur across all random forest models.

For detached, district scores on community, education and transit appear in addition to the above

For condos, district scores on shopping and health appear in addition to the above.

Random forest R-squared and MSE as a function of the number of trees. For each model the number of variables randomly sampled are selected based on maximizing the R2 and MSE comparison across across a 1 to 15 range.

5. Deployment and further developments

The model and the shiny app could be used by a real estate company to provide potential customers with a self-serving tool that they can use prior to contacting the company.

The shiny app published at https://groupa-autumnclass.shinyapps.io/GroupProject-Shiny/ has two features available for the customer:

  1. the first feature enables, potential sellers to obtain a ballpark figure value of what their house might be worth on the market.

  2. the second feature enables, potential buyers to visualize properties sold, based on their specific criteria (available budget, required square footage, number of bedrooms/bathrooms, etc.).

The tool has potential for improvement in three directions: model estimation, additional explanatory factors and improving the shiny app.

Model estimation

For both the detached and non-detached segments the model diagnostics statistics show that there is still a remaining non-linear relationship between the estimated level of the sale price and the model estimated residuals.Also, the residuals standard errors are quite high, around 270 thousand CAD for detached and 180 thousand CAD for non-detached, sizes which are acceptable for an average property price of 2 million CAD but less acceptable for a property valued at around 700 thousand CAD.

The model can be enhanced by:

  1. applying a transform to the sale price
  2. a deeper research into binning the sale price and some of the explanatory variables to see if certain linear relationships we assumed can be transformed as such into a less constrained interaction effect,
  3. estimating, in addition to the random forest, a set of non-parametric models that could better model the properties that are outside the +/- 2 stdev range.
  4. model averaging

Additional explanatory factors

In it’s current version the district-related predictors (neighborhood rankings) are not necessarily accurate representations of the attractiveness of the immediate neighborhood for the buyer as Toronto districts are relatively large and heterogeneous. Each individual customer has different preferences in terms of what neighborhood criteria are more or less important to them.

Given the time constraint we did not include the “walking distance to subway” feature as it required more time for implementing in the shiny app. The feature was detected as important by the random forest algorithm and should be included at the nest project iteration.

Finally, if environmental parameters, such as closeness to large parks and Lake Ontario, as well as neighborhood quietness and distance to major arterial roads (noise/air pollution) could be included, this tool would be potentially very valuable.

Shiny app

Potential improvements we have already discussed but did not have time to implement:

  1. adding the estimated price standard-deviation when estimating the price for a property

  2. adding a sales price histogram to visualize the range of prices when visualizing the properties sold, based on user-defined criteria

  3. adding a filter option based on a geographical radius around an address on the map for the user to visualize the properties within a certain radius

  4. adding subway lines and the subway proximity criteria